WIP: Resurrect #343

shlomi-noach · 2016-12-20T13:54:08Z

Storyline: #205

WORK IN PROGRESS: resurrecting a migration after failure.
The idea is that gh-ost would routinely dump migration status/context. It would be ossible for one gh-ost process to fail (e.g. having met critical-load) and for another gh-ost process to pick up from where the first left off.

Initial commits present exporting of migration context, with some shuffling & cleanup.

shlomi-noach · 2016-12-20T14:26:26Z

TODO:

must not export passwords

shlomi-noach · 2016-12-20T14:29:03Z

Export is to changelog table. This ensures atomicy and durability of write, assuming changelog table is InnoDB. notable is that if migrated table is MyISAM, so is the changelog table.

I'm fine stating that resurrection does not work on MyISAM, because MyISAM.

shlomi-noach · 2016-12-21T15:56:38Z

5f25f74 makes for something that works! I'll need to iterate to see what has been overlooked, but basically we're getting there fast.

…s from the resurrected context

shlomi-noach · 2016-12-23T07:11:15Z

A concern is to not rely on the streamer's last known position, because streamer writes to a buffer (currently hard coded to 100 events). Those events would be lost upon resurrection.

Instead, we should have the migration report the last applied event's coordinates.

StreamerBinlogCoordinates -> AppliedBinlogCoordinates updating AppliedBinlogCoordinates when truly applied; no longer asking streamer for coordinates (because streamer's events can be queued, but not handled, a crash implies we need to look at the last _handled_ event, not the last _streamed_ event)

…ates

…context

…nitial resurrection context

tomkrouper · 2017-03-02T16:50:43Z

Off issue, you were mentioning gh-ost was having checksum issues with resurrections. When you mentioned that I was thinking, could it be related to the fact that we have two things going on: the backlog and the iteration of inserts? I hope that makes sense. None the less, it was something that popped in my head that I hoped might help when you get back to this. (Not fully understanding the code changes, this might already be something you're handling.)

shlomi-noach · 2017-03-02T16:59:29Z

@tomkrouper the conjecture is as follows:

assuming gh-ost breaks while copying rows 5,000-5,100
and while reading mysql-bin.000120 at position 123456

it should be OK to resume execution

start with copying rows 4,300-4,400 (way before the point of breakage)
start with reading binary log mysql-bin.000120 at position 121234 (way before the point of breakage)

this is the conjecture's logic:

re-copying same rows just overwrites existing rows (or adds rows that weren't there before!)
re-applying binary logs is an idempotent action
I find it a bit difficult right now to substantiate these claims, but I believe them to be true.

But then, of course, tests are failing...

Xopherus · 2018-06-29T14:55:35Z

@shlomi-noach are there any plans to revisit this feature? I'm looking at gh-ost again and one of the concerns my team has is that we have some very large tables that can take days, if not a week to copy. If the process were to crash in the middle we'd have a lot of wasted effort, especially when we have to slowly drain the _gho table to prevent the dreaded global metadata lock when dropping it. This would be extremely helpful for us!

shlomi-noach · 2018-07-01T05:26:49Z

@Xopherus this isn't on the near future's roadmap.
FWIW, we are likewise running week long, or even 22-day long migration in once case. We use

-critical-load-hibernate-seconds=3600

such that hitting critical load doesn't bail out.

I understand the stress involved with running a week long migration. Our history shows those migrations do not break, hence the Resurrection feature is not urgent for us to implement.

Xopherus · 2018-07-07T20:27:31Z

Thanks for the advice @shlomi-noach! Appreciate the wisdom - I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long. I'll have to try that parameter and let you know how it goes.

shlomi-noach · 2018-07-08T04:40:17Z

I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long

@Xopherus Could you please elaborate on that? I'm not sure I understand.

Xopherus · 2018-07-11T02:03:10Z

Oh I just mean that if your migrations can take multiple hours / days, it can be tricky to tune parameters (e.g. critical load threshold or lock cutover timeouts, etc) because it takes longer to experiment. Fortunately we've gotten solid advice from you and others here to help guide us in the right direction.

daniel-nichter · 2019-02-13T18:20:03Z

Hi @shlomi-noach :-) I think "resurrect" is not the best term. It's not a standard technical term. Even doc/command-line-flags.md has to clarify: "It is possible to resurrect/resume a failed migration". When people think, "Can I resume an osc?", they'll look for and Google with that term. Imho, "resurrect" will never cross people's minds. By contrast, everyone knows what "resume" (and its reciprocals "suspend" or "pause") mean. I'd also argue that it's not technically descriptive or intention-revealing. A dead body can be resurrected, and I get the joke with the app being called "ghost", but it begs the question: What does it mean to resurrect a program? My last argument is: for non-native English speakers/readers, these issues are compounded by uncommon words in a technical context. -- I'd vote for pause/resume or start/stop.

shlomi-noach · 2019-02-14T05:06:25Z

Thank you @daniel-nichter

rakhi-s · 2020-09-09T21:16:36Z

bumping this feature request to check on if there were any changes to make this feature available?

tomkrouper · 2020-09-09T21:20:22Z

bumping this feature request to check on if there were any changes to make this feature available?

This code is fairly old and there are a bunch of conflicting files at this point. We don't have any immediate plans to work on this, but I do agree this would be a good feature to have and if anyone would like to continue the work, we'd love the community contribution.

meiji163 · 2026-01-12T22:14:11Z

superseded by #1595

Shlomi Noach added 2 commits December 18, 2016 09:23

resurrection: dump/restore of migration context cross executions

66894d3

encoding range values as base64

75b6f9e

Shlomi Noach added 3 commits December 20, 2016 16:26

Merge branch 'master' into resurrect

776c8d3

exporting to changelog table, not to file

6999b4e

Merge branch 'resurrect' of github.com:github/gh-ost into resurrect

5e0f38c

Shlomi Noach added 8 commits December 20, 2016 16:38

context dump serialized with table writes; avoiding sync problems

3223a93

storing and updating streamer binlog coordinates

6f81d62

passwords not exported in MigrationContext

4c6f42f

initial support for --resurrect flag

c72851e

sanity checks for resurrection

171cad2

comment typo

47d8306

sanity checks on --resurrection; skipping some normal-mode operations

bad30a8

something that works! True resurrection applied

5f25f74

instead of loading the entire context, only updating particular field…

89ca346

…s from the resurrected context

Shlomi Noach added 6 commits December 23, 2016 15:24

at resurrection, pointing streamer back at last known applied coordin…

e50361a

…ates

some cleanup

6128076

applying IsResurrected flag

45b63f6

added context test, JSON export/import

fa399e0

format

7dfb740

shlomi-noach deployed to production/ghost-mysql001 December 24, 2016 17:58 Active

Shlomi Noach added 3 commits December 25, 2016 08:53

added on-resurrecting hook

0e8e5de

Resurrection documentation

af74e8c

typo

874cf24

shlomi-noach deployed to production/ghost-mysql001tb December 27, 2016 06:10 Active

rewinding resurrecting at beginning of known logfile; more verbose

8952e24

shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 11:36 Active

resurrected execution does not apply migration range from terminated …

90f61f8

…context

shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 12:24 Active

making sure to dump context before row-copy, so we always have some i…

e4874c8

…nitial resurrection context

shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 21:06 Active

allowing EOF result for loadJSON

e9e9d6d

shlomi-noach deployed to production/ghost-mysql001 December 28, 2016 21:17 Active

ght/ghr suffix -> delr suffix

24f5c6d

shlomi-noach deployed to production/ghost-mysql001 December 29, 2016 08:24 Active

not applying range if nil

7bdfd1b

shlomi-noach deployed to production/ghost-mysql001 December 29, 2016 08:27 Active

Merge branch 'master' into resurrect

0b6d834

shlomi-noach mentioned this pull request Dec 29, 2016

cli: --check-flag #351

Merged

Merge branch 'master' into resurrect

856d0d4

shlomi-noach deployed to production/ghost-mysql001 December 30, 2016 06:03 Active

morgo mentioned this pull request Nov 17, 2022

Performance Improvements (Raw Ideas) #1204

Open

meiji163 mentioned this pull request Oct 14, 2025

Resume from checkpoint #1595

Merged

2 tasks

meiji163 closed this Jan 12, 2026

WIP: Resurrect #343

WIP: Resurrect #343

Uh oh!

Conversation

shlomi-noach commented Dec 20, 2016

Uh oh!

shlomi-noach commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shlomi-noach commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shlomi-noach commented Dec 21, 2016

Uh oh!

shlomi-noach commented Dec 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomkrouper commented Mar 2, 2017

Uh oh!

shlomi-noach commented Mar 2, 2017

Uh oh!

Xopherus commented Jun 29, 2018

Uh oh!

shlomi-noach commented Jul 1, 2018

Uh oh!

Xopherus commented Jul 7, 2018

Uh oh!

shlomi-noach commented Jul 8, 2018

Uh oh!

Xopherus commented Jul 11, 2018

Uh oh!

daniel-nichter commented Feb 13, 2019

Uh oh!

shlomi-noach commented Feb 14, 2019

Uh oh!

rakhi-s commented Sep 9, 2020

Uh oh!

tomkrouper commented Sep 9, 2020

Uh oh!

meiji163 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

shlomi-noach commented Dec 20, 2016 •

edited

Loading

shlomi-noach commented Dec 20, 2016 •

edited

Loading

shlomi-noach commented Dec 23, 2016 •

edited

Loading